speech synthesizer
Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder
Suh, Soobin, Ahn, Dabi, Park, Heewoong, Park, Jonghun
V oice conversion is a task of synthesizing an utterance with target speaker's voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different intonations, conventional voice conversion models were limited to producing only one result per source input. To overcome this limitation, we propose a novel approach for voice conversion with diverse intonations using conditional variational autoencoder (CV AE). Experiments have shown that the speaker's style feature can be mapped into a latent space with Gaussian distribution. We have also been able to convert voices with more diverse intonation by making the posterior of the latent space more complex with inverse autoregressive flow (IAF). As a result, the converted voice not only has a diversity of intonations, but also has better sound quality than the model without CV AE.
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
Yu, Wenyi, Wang, Siyin, Yang, Xiaoyu, Chen, Xianzhao, Tian, Xiaohai, Zhang, Jun, Sun, Guangzhi, Lu, Lu, Wang, Yuxuan, Zhang, Chao
Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.
Reviews: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
This work offers a clearly defined extension to TTS systems allowing to build good quality voices (even unseen ones during training of either component) from a few adaptation data-points. Authors do not seem to offer any truly new theoretical extension to "building blocks" of their system, which is based on known components proposed elsewhere (speaker encoder, synthesizer and vocoder are based on previously published models). However, their mutual combination is clever, well-engineered and allows building blocks to by independently estimated in either unsupervised (speaker encoder, where audio transcripts are not needed) or supervised (speech synthesizer) ways, on different corpora. This allows for greater flexibility, reducing at the same time requirements for large amounts of transcribed data for each of the components (i.e. Good points: - clear, fair and convincing experiments - trained and evaluated on public corpora, which greatly increases reproducibility (portion of the experiments is carried on proprietary data, but all have equivalent experiments constrained to publicly available data) Weak points: - it would probably make sense to investigate the additional adaptability in case one gets more data per speaker, it seems your system cannot easily leverage more than 10s of reference speech data Summary: this is a very good study on generating multi-speaker TTS systems from small amounts of target speaker data.
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
Lee, Sang-Hoon, Choi, Ha-Yeong, Kim, Seung-Bin, Lee, Seong-Whan
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/sh-lee-prml/HierSpeechpp.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.
Disabled lawmaker first in Japan to use speech synthesizer during Diet session
A lawmaker with severe physical disabilities attended his first parliamentary interpellation Thursday since being elected in July and became the first lawmaker in Japan ever to use an electronically-generated voice during a Diet session. In the session of the education, culture and science committee, Yasuhiko Funago, who has amyotrophic lateral sclerosis, a condition also known as Lou Gehrig's disease, greeted the committee using a speech synthesizer. He also asked questions through a proxy speaker. "As a newcomer, I am still inexperienced, but with everyone's assistance, I will do my best to tackle (issues)," he said at the beginning of the session. An aide then posed questions on his behalf and expressed his desire to see improvements in the learning environment for disabled children.
A Fully Time-domain Neural Model for Subband-based Speech Synthesizer
Rabiee, Azam, Kim, Geonmin, Kim, Tae-Ho, Lee, Soo-Young
This paper introduces a deep neural network model for subband-based speech synthesizer. The model benefits from the short bandwidth of the subband signals to reduce the complexity of the time-domain speech generator. We employed the multi-level wavelet analysis/synthesis to decompose/reconstruct the signal into subbands in time domain. Inspired from the WaveNet, a convolutional neural network (CNN) model predicts subband speech signals fully in time domain. Due to the short bandwidth of the subbands, a simple network architecture is enough to train the simple patterns of the subbands accurately. In the ground truth experiments with teacher-forcing, the subband synthesizer outperforms the fullband model significantly in terms of both subjective and objective measures. In addition, by conditioning the model on the phoneme sequence using a pronunciation dictionary, we have achieved the fully time-domain neural model for subband-based text-to-speech (TTS) synthesizer, which is nearly end-to-end. The generated speech of the subband TTS shows comparable quality as the fullband one with a slighter network architecture for each subband.
Brain implants, AI, and a speech synthesizer have turned brain activity into robot words
Neural networks have been used to turn words that a human has heard into intelligible, recognizable speech. It could be a step toward technology that can one day decode people's thoughts. A challenge: Thanks to fMRI scanning, we've known for decades that when people speak, or hear others, it activates specific parts of their brain. However, it's proved hugely challenging to translate thoughts into words. A team from Columbia University has developed a system that combines deep learning with a speech synthesizer to do just that.
New AI Mimics Any Voice in a Matter of Minutes
The story starts out like a bad joke: Obama, Clinton and Trump walk into a bar, where they applauded a new startup based in Montreal, Canada called Lyrebird. If the scenario seems too bizarre to be real, you're right--it's not. The entire recording was generated by a new AI with the ability to mimic natural conversation, at a rate much faster than any previous speech synthesizer. From there, it adds an extra layer of emotion or special intonation, until it nails a person's voice, tone and accent--may it be Obama, Trump or even you. While Lyrebird still retains a slight but noticeable robotic buzz characteristic of machine-generated speech, add some smartly-placed background noise to cover up the distortion, and the recordings could pass off as genuine to unsuspecting ears.
Glove-TalkII: Mapping Hand Gestures to Speech Using Neural Networks
Fels, Sidney, Hinton, Geoffrey E.
Glove-TaikII is a system which translates hand gestures to speech through an adaptive interface. Hand gestures are mapped continuously to 10 control parameters of a parallel formant speech synthesizer. The mapping allows the hand to act as an artificial vocal tract that produces speech in real time. This gives an unlimited vocabulary in addition to direct control of fundamental frequency and volume. Currently, the best version of Glove-TalkII uses several input devices (including a CyberGlove, a ContactGlove, a 3-space tracker, and a foot-pedal), a parallel formant speech synthesizer and 3 neural networks.